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Abstract 

The  HPCChallenge  suite  of  benchmarks  will  examine 
the  performance  of  HPC  architectures  using  kernels 
with  memory  access  patterns  more  challenging  than 
those  of  the  High  Performance  Linpack  (HPL)  bench¬ 
mark  used  in  the  Top500  list.  The  HPCChallenge  suite 
is  being  designed  to  augment  the  Top500  list,  provide 
benchmarks  that  bound  the  performance  of  many  real 
applications  as  a  function  of  memory  access  character¬ 
istics  e.g.,  spatial  and  temporal  locality,  and  provide  a 
framework  for  including  additional  benchmarks.  The 
HPCChallenge  benchmarks  are  scalable  with  the  size  of 
data  sets  being  a  function  of  the  largest  HPL  matrix  for  a 
system.  The  HPCChallenge  benchmark  suite  has  been 
released  by  the  DARPA  HPCS  program  to  help  define 
the  performance  boundaries  of  future  Petascale  com¬ 
puting  systems.  The  suite  is  composed  of  several  well 
known  computational  kernels  (STREAM,  High  Perfor¬ 
mance  Linpack,  matrix  multiply  -  DGEMM,  matrix 
transpose,  FFT,  RandomAccess,  and  bandwidth/latency 
tests)  that  attempt  to  span  high  and  low  spatial  and  tem¬ 
poral  locality  space. 

1  High  Productivity  Computing 

Systems 

The  DARPA  High  Productivity  Computing  Sys¬ 
tems  (HPCS)  [1]  is  focused  on  providing  a  new  gen¬ 
eration  of  economically  viable  high  productivity  com¬ 
puting  systems  for  national  security  and  for  the  indus- 
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trial  user  community.  HPCS  program  researchers  have 
initiated  a  fundamental  reassessment  of  how  we  define 
and  measure  performance,  programmability,  portabil¬ 
ity,  robustness  and  ultimately,  productivity  in  the  HPC 
domain. 

The  HPCS  program  seeks  to  create  trans-Petaflop 
systems  of  significant  value  to  the  Government  HPC 
community.  Such  value  will  be  determined  by  assess¬ 
ing  many  additional  factors  beyond  just  theoretical  peak 
flops  (floating-point  operations).  Ultimately,  the  goal  is 
to  decrease  the  time-to- solution,  which  means  decreas¬ 
ing  both  the  execution  time  and  development  time  of  an 
application  on  a  particular  system.  Evaluating  the  capa¬ 
bilities  of  a  system  with  respect  to  these  goals  requires  a 
different  assessment  process.  The  goal  of  the  HPCS  as¬ 
sessment  activity  is  to  prototype  and  baseline  a  process 
that  can  be  transitioned  to  the  acquisition  community 
for  2010  procurements. 

The  most  novel  part  of  the  assessment  activity  will 
be  the  effort  to  measure/predict  the  ease  or  difficulty 
of  developing  HPC  applications.  Currently,  there  is  no 
quantitative  methodology  for  comparing  the  develop¬ 
ment  time  impact  of  various  HPC  programming  tech¬ 
nologies.  To  achieve  this  goal,  the  HPCS  program  is 
using  a  variety  of  tools  including 

•  Application  of  code  metrics  on  existing  HPC 
codes, 

•  Several  prototype  analytic  models  of  development 
time, 

•  Interface  characterization  (e.g.  programming  lan¬ 
guage,  parallel  model,  memory  model,  communi¬ 
cation  model), 

•  Scalable  benchmarks  designed  for  testing  both  per¬ 
formance  and  programmability, 
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•  Classroom  software  engineering  experiments, 

•  Human  validated  demonstrations. 

These  tools  will  provide  the  baseline  data  necessary 
for  modeling  development  time  and  allow  the  new  tech¬ 
nologies  developed  under  HPCS  to  be  assessed  quanti¬ 
tatively. 

As  part  of  this  effort  we  are  developing  a  scalable 
benchmark  for  the  HPCS  systems. 

The  basic  goal  of  performance  modeling  is  to  mea¬ 
sure,  predict,  and  understand  the  performance  of  a  com¬ 
puter  program  or  set  of  programs  on  a  computer  system. 
The  applications  of  performance  modeling  are  numer¬ 
ous,  including  evaluation  of  algorithms,  optimization 
of  code  implementations,  parallel  library  development, 
and  comparison  of  system  architectures,  parallel  system 
design,  and  procurement  of  new  systems. 

2  Motivation 

The  DARPA  High  Productivity  Computing  Sys¬ 
tems  (HPCS)  program  has  initiated  a  fundamental  re¬ 
assessment  of  how  we  define  and  measure  perfor¬ 
mance,  programmability,  portability,  robustness  and, 
ultimately,  productivity  in  the  HPC  domain.  With  this 
in  mind,  a  set  of  kernels  was  needed  to  test  and  rate  a 
system.  The  HPCChallenge  suite  of  benchmarks  con¬ 
sists  of  four  local  (matrix-matrix  multiply,  STREAM, 
RandomAccess  and  FFT)  and  four  global  (High  Per¬ 
formance  Linpack  -  HPL,  parallel  matrix  transpose 
-  PTRANS,  RandomAccess  and  FFT)  kernel  bench¬ 
marks.  HPCChallenge  is  designed  to  approximately 
bound  computations  of  high  and  low  spatial  and  tem¬ 
poral  locality  (see  Figure  1).  In  addition,  because  HPC¬ 
Challenge  kernels  consist  of  simple  mathematical  op¬ 
erations,  this  provides  a  unique  opportunity  to  look  at 
language  and  parallel  programming  model  issues.  In 
the  end,  the  benchmark  is  to  serve  bothe  the  system  user 
and  designer  communities  [2]. 

3  The  Benchmark  Tests 

This  first  phase  of  the  project  have  developed,  hardened, 
and  reported  on  a  number  of  benchmarks.  The  collec¬ 
tion  of  tests  includes  tests  on  a  single  processor  (local) 


and  tests  over  the  complete  system  (global).  In  partic¬ 
ular,  to  characterize  the  architecture  of  the  system  we 
consider  three  testing  scenarios: 

1.  Local  -  only  a  single  processor  is  performing  com¬ 
putations. 

2.  Embarrassingly  Parallel  -  each  processor  in  the  en¬ 
tire  system  is  performing  computations  but  they  do 
no  communicate  with  each  other  explicitly. 

3.  Global  -  all  processors  in  the  system  are  perform¬ 
ing  computations  and  they  explicitly  communicate 
with  each  other. 

The  HPCChallenge  benchmark  consists  at  this  time 
of  7  performance  tests:  HPL  [3],  STREAM  [4], 
RandomAccess,  PTRANS,  FFT  (implemented  us¬ 
ing  FFTE  [5]),  DGEMM  [6,  7]  and  b  eff  La¬ 
tency/Bandwidth  [8,  9,  10].  HPL  is  the  Linpack 
TPP  (toward  peak  performance)  benchmark.  The  test 
stresses  the  floating  point  performance  of  a  system. 
STREAM  is  a  benchmark  that  measures  sustainable 
memory  bandwidth  (in  GB/s),  RandomAccess  mea¬ 
sures  the  rate  of  random  updates  of  memory.  PTRANS 
measures  the  rate  of  transfer  for  larges  arrays  of  data 
from  multiprocessor’s  memory.  Latency /Band  width 
measures  (as  the  name  suggests)  latency  and  bandwidth 
of  communication  patterns  of  increasing  complexity  be¬ 
tween  as  many  nodes  as  is  time-wise  feasible. 

Many  of  the  aforementioned  tests  were  widely  used 
before  HPCChallenge  was  created.  At  first,  this  may 
seemingly  make  our  benchmark  merely  a  packaging 
effort.  However,  almost  all  components  of  HPCChal¬ 
lenge  were  augmented  from  their  original  form  to  pro¬ 
vide  consistent  verification  and  reporting  scheme.  We 
should  also  stress  the  importance  of  running  these  very 
tests  on  a  single  machine  and  have  the  results  available 
at  once.  The  tests  were  useful  separately  for  the  HPC 
community  before  and  with  the  unified  HPCChallenge 
framework  they  create  an  unprecendented  view  of  per¬ 
formance  characterization  of  a  system  -  a  comprehen¬ 
sive  view  that  captures  the  data  under  the  same  condi¬ 
tions  and  allows  for  variety  of  analysis  depending  on 
end  user  needs. 

Each  of  the  included  tests  examines  system  perfor¬ 
mance  for  various  points  of  the  conceptual  spatial  and 
temporal  locality  space  shown  in  Figure  1.  The  ra¬ 
tionale  for  such  selection  of  tests  is  to  measures  per- 
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Figure  1 :  Targeted  application  areas  in  the  memory  access  locality  space. 


formance  bounds  on  metrics  important  to  HPC  appli¬ 
cations.  The  expected  behavior  of  the  applications  is 
to  go  through  various  locality  space  points  during  run¬ 
time.  Consequently,  an  application  may  be  represented 
as  a  point  in  the  locality  space  being  an  average  (pos¬ 
sibly  time-weighed)  of  its  various  locality  behaviors. 
Alternatively,  a  decomposition  can  be  made  into  time- 
disjoint  periods  in  which  the  application  exhibits  a  sin¬ 
gle  locality  characteristic.  The  application’s  perfor¬ 
mance  is  then  obtained  by  combining  the  partial  results 
from  each  period. 

Another  aspect  of  performance  assesment  addressed 
by  HPC  Challenge  is  ability  to  optimize  benchmark 
code.  For  that  we  allow  two  different  runs  to  be  re¬ 
ported: 

•  Base  run  done  with  with  provided  reference  imple¬ 
mentation. 

•  Optimized  run  that  uses  architecture  specific  opti¬ 
mizations. 

The  base  run,  in  a  sense,  represents  behavior  of  legacy 
code  because  it  is  conservatively  written  using  only 
widely  available  programming  languages  and  libraries. 
It  reflects  a  commonly  used  approach  to  prallel  pro¬ 
cessing  sometimes  referred  to  as  hierachical  parallelism 
that  combines  Message  Passing  Interface  (MPI)  with 
threading  from  OpenMP.  At  the  same  time  we  recog¬ 
nize  the  limitations  of  the  base  run  and  hence  we  al¬ 
low  (or  even  encourage)  optimized  runs  to  be  made. 
The  optimizations  may  include  alternative  implemen¬ 
tations  in  different  programming  languages  using  par¬ 
allel  environments  available  specifically  on  the  tested 


system.  To  stress  the  productivity  aspect  of  the  HPC 
Challange  benchmark,  we  require  that  the  information 
about  the  changes  made  to  the  orignial  code  be  submit¬ 
ted  together  with  the  benchmark  results.  While  we  un¬ 
derstand  that  full  disclosure  of  optimization  techniques 
may  sometimes  be  impossible  to  obtain  (due  to  for  ex¬ 
ample  trade  secrets)  we  ask  at  least  for  some  guidence 
for  the  users  that  would  like  to  use  similar  optimizations 
in  their  applications. 

4  Benchmark  Details 

Almost  all  tests  included  in  our  suite  operate  on  either 
matrices  or  vectors.  The  size  of  the  former  we  will  de¬ 
note  below  as  n  and  the  latter  as  m.  The  following  holds 
throughout  the  tests: 

n  ~  m  ~  Available  Memory 

Or  in  other  words,  the  data  for  each  test  is  scaled  so  that 
the  matrices  or  vectors  are  large  enough  to  fill  almost 
all  available  memory. 

HPL  is  the  Linpack  TPP  (Toward  Peak  Performance) 
variant  of  the  original  Linpack  benchmark  which  mea¬ 
sures  the  floating  point  rate  of  execution  for  solving  a 
linear  system  of  equations.  HPL  solves  a  linear  system 
of  equations  of  order  n: 

Ax  =  b ;  A6R"x";x,kr 

by  first  computing  LU  factorization  with  row  partial 
pivoting  of  the  n  by  n  +  1  coefficient  matrix: 

P\A,b]  =  [[L,U],y\. 
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Since  the  row  pivoting  (represented  by  the  permutation 
matrix  P)  and  the  lower  triangular  factor  L  are  applied 
to  b  as  the  factorization  progresses,  the  solution  x  is  ob¬ 
tained  in  one  step  by  solving  the  upper  triangular  sys¬ 
tem: 

Ux  =  y. 

The  lower  triangular  matrix  L  is  left  unpivoted  and  the 
array  of  pivots  is  not  returned.  The  operation  count 
for  the  factorization  phase  is  |/?3  —  \n2  and  In2  for  the 
solve  phase.  Correctness  of  the  solution  is  accertained 
by  calculating  scaled  residuals: 


\\Ax-bWoo 
m\in  ’ 

\\Ax-b\loo 

eINilWli’ 

\\Ax-b\\oo 

e||A||oo||.x:||oo  ’ 


and 


As  mentioned  earlier,  we  try  to  operate  on  large  data 
objects.  The  size  of  these  objects  is  determined  at  run¬ 
time  which  contrasts  wit  the  original  version  of  the 
STREAM  benchmark  which  uses  static  storage  (deter¬ 
mined  at  compile  time)  an  size.  The  original  benchmark 
gives  the  compiler  more  information  (and  control)  over 
data  alignment,  loop  trip  counts,  etc.  The  benchmark 
measure  GB/s  and  the  number  of  items  transferred  is 
either  2m  or  3m  depending  on  the  operation.  The  norm 
of  diffemce  between  reference  and  computed  vectors  is 
used  to  verify  the  result:  ||x  —  x|| . 

PTRANS  (parallel  matrix  transpose)  exercises  the 
communications  where  pairs  of  processors  communi¬ 
cate  with  each  other  simultaneously.  It  is  a  useful  test  of 
the  total  communications  capacity  of  the  network.  The 
performed  operation  sets  a  random  an  n  by  n  matrix  to 
a  sum  of  its  transpose  with  another  random  matrix: 

A  ^AT  +B 


where  £  is  machine  precision  for  64-bit  floating-point 
values. 

DGEMM  measures  the  floating  point  rate  of  execution 
of  double  precision  real  matrix-matrix  multiplication. 
The  exact  operation  performed  is: 

C  < —  (3  C  -T  cx  AB 

where: 

A,B,Ce  R"xn;  a,  P  g  R". 

The  operation  count  for  the  multiply  is  2/73  and  cor¬ 
rectness  of  the  operation  is  accertained  by  calculating 
scaled  residual:  (C  is  the  result  of  reference  im¬ 

plementation  of  the  multiplication). 

STREAM  a  simple  synthetic  benchmark  program  that 
measures  sustainable  memory  bandwidth  (in  GB/s)  and 
the  corresponding  computation  rate  for  four  simple  vec¬ 
tor  kernels: 


COPY: 

C  <r- 

-  a 

SCALE: 

b  <r- 

a  c 

ADD: 

c  <- 

-  a  +  b 

TRIAD: 

a  <- 

-  b  +  cnc 

where: 

a,b,ceRm;  a  e  R. 


where: 

A, Be  Rnxn. 

The  data  transfer  rate  (in  GB/s)  is  calculated  by  divid¬ 
ing  the  size  of  n2  matrix  entries  by  the  time  it  took  to 
perform  the  transpose.  The  scaled  residual  of  the  form 

^  ^  verifies  the  calculation. 

RandomAccess  measures  the  rate  of  integer  random 
updates  of  memory  (GUPS).  The  operation  being  per¬ 
formed  on  an  integer  array  of  size  m  is: 

x  fix) 

f :  x  i— >  {x  ©  cii) ;  cij-  pseudo-random  sequence 
where: 

/:  T"->Zm;  x  G  Zm. 

The  operation  count  is  m  and  since  all  the  operations  are 
in  integral  values  using  Galois  field  they  can  be  checked 
exactly  with  a  reference  implementation.  The  verifica¬ 
tion  procedure  allows  1%  of  the  operations  to  be  incor¬ 
rect  (either  skipped  or  done  in  the  wrong  order)  which 
allows  loosening  concurrent  memory  update  semantics 
on  shared  memory  architectures. 

FFT  measures  the  floating  point  rate  of  execution 
of  double  precision  complex  one-dimensional  Discrete 
Fourier  Transform  (DFT)  of  size  in: 

in 

Zk  2m  '"‘ ;  1  <  k  <  m 
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where: 


z,  Z  G  C'n . 

The  operation  count  is  taken  to  be  5m  log2  m  for  the  cal¬ 
culation  of  the  computational  rate  (in  GFlop/s).  Ver¬ 
ification  is  done  with  a  residual  A  where  x  is 
the  result  of  applying  a  refernce  implementation  of  in¬ 
verse  transform  to  the  outcome  of  the  benchmarked 
code  (in  infinite -precision  arithmetic  the  residual  should 
be  zero). 

Communication  bandwidth  and  latency  is  a  set  of 
tests  to  measure  latency  and  bandwidth  of  a  number  of 
simultaneous  communication  patterns.  The  patterns  are 
based  on  b_eff  (effective  bandwidth  benchmark)  -  they 
are  slightly  different  from  the  original  b_eff.  The  oper¬ 
ation  count  is  linearly  dependant  on  the  number  of  pro¬ 
cessors  in  the  tested  system  and  the  time  the  tests  take 
depends  on  the  parameters  of  the  tested  network.  The 
checks  are  built  into  the  benchmark  code  by  checking 
data  after  it  has  been  received. 

5  Rules  for  Running  the  Bench¬ 
mark 

There  must  be  one  baseline  run  submitted  for  each  com¬ 
puter  system  entered  in  the  archive.  There  may  also  ex¬ 
ist  an  optimized  run  for  each  computer  system. 

1.  Baseline  Runs 

Optimizations  as  described  below  are  allowed. 

(a)  Compile  and  load  options 

Compiler  or  loader  flags  which  are  supported 
and  documented  by  the  supplier  are  allowed. 
These  include  porting,  optimization,  and  pre¬ 
processor  invocation. 

(b)  Libraries 

Linking  to  optimized  versions  of  the  follow¬ 
ing  libraries  is  allowed: 

•  BLAS 

•  MPI 

Acceptable  use  of  such  libraries  is  subject  to 
the  following  rules: 

•  All  libraries  used  shall  be  disclosed  with 
the  results  submission.  Each  library 


shall  be  identified  by  library  name,  re¬ 
vision,  and  source  (supplier).  Libraries 
which  are  not  generally  available  are  not 
permitted  unless  they  are  made  avail¬ 
able  by  the  reporting  organization  within 
6  months. 

•  Calls  to  library  subroutines  should  have 
equivalent  functionality  to  that  in  the  re¬ 
leased  benchmark  code.  Code  modifi¬ 
cations  to  accommodate  various  library 
call  formats  are  not  allowed. 

•  Only  complete  benchmark  output  may 
be  submitted  -  partial  results  will  not  be 
accepted. 

2.  Optimized  Runs 

(a)  Code  modification 

Provided  that  the  input  and  output  specifica¬ 
tion  is  preserved,  the  following  routines  may 
be  substituted: 

•  In  HPL:  HPL.pdgesv  ( ) ,  HPL_pdtrsv() 

(factorization  and  substitution  functions) 

•  no  changes  are  allowed  in  the  DGEMM 
component 

•  In  PTRANS:  pdtransf) 

•  In  STREAM:  tuned_STREAM_Copy  ( ) , 
tuned_STREAM_Scale () , 
tuned_STREAM_Add  ( ) , 
tuned_STREAM_Triad ( ) 

•  In  RandomAccess: 

MPIRandomAccessUpdate ( )  and 

RandomAccessUpdate () 

•  In  FFT:  f  ftw_malloc  ( ) , 

fftw_free(),  f ftw_create_plan  ( ) , 
fftw_one(),  fftw_destroy_plan ( ) , 
f ftw_mpi_create_plan ( ) , 

f ftw_mpi_local_sizes ( ) , 
f ftw_mpi ( ) , 

f  ftw_mpi_destroy_plan  ( )  (all  these 
functions  are  compatible  with  FFTW 
2.1.5  [11,  12]  so  the  benchmark  code 
can  be  directly  linked  against  FFTW 
2.1.5  by  only  adding  proper  compiler 
and  linker  flags,  e.g.  -DUSING_FFTW) 

•  In  Latency /Bandwidth  component  alter¬ 
native  MPI  routines  might  be  used  for 
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Processor 

Type  -  Speed  -  Count 

G-HPL 

G-PTRANS 

G- Random 

Access 

EP-STREAM 

Triad 

G-FFTE 

EP-DGEMM 

Random  Ring 
Bandwidth 

Random  Ring 
Latency 

MA/PT/PS/PC/CM/CS/IC/IA/SD 

TFlop/s 

GB/s 

Gup/s 

GB/s 

G  Flop/s 

G  Flop/s 

GB/s 

usee 

Cray  Alpha  21164 

0.6GHz  1024 

0.0482 

10.277 

0.517 

0.03174 

12.09 

Cray  Alpha  21164 

,675GHz 

512 

0.2232 

9.774 

0.028946 

0.532 

15.477 

0.661 

0.03571 

8.14 

HP  Alpha  21264B 

1GHz 

128 

0.1905 

1.507 

0.803 

0.02785 

37.31 

HP  (Compaq)  Alpha  21264B 

1GHz 

484 

0.6181 

3.739 

1.389 

0.02269 

39.91 

Hewlett-Packard  Alpha  21264B  0.833GH 

484 

0.4337 

5.029 

0.006283 

0.791 

4.509 

1.045 

0.01729 

50.10 

Hewlett-Packard  Alpha  21264C 

1GHz 

484 

0.5805 

6.370 

0.008090 

1.303 

5.008 

1.218 

0.02260 

39.63 

Atipa  AMD  Opteron 

1.4GHz 

128 

0.2526 

3.247 

1.629 

0.03627 

23.68 

Dalco  AMD  Opteron 

2.2GHz 

64 

0.2180 

6.320 

0.004700 

2.397 

13.548 

3.879 

0.17003 

11.46 

Cray  AMD  Opteron 

2.2GHz 

64 

0.2239 

10.592 

0.022397 

2.656 

16.361 

4.034 

0.22697 

1.63 

Cray  XI  MSP 

0.8GHz 

64 

0.5216 

3.229 

14.990 

0,94074 

20.34 

Cray  XI  MSP 

0.8GHz 

60 

0.5778 

30.431 

14.974 

1.03291 

20.83 

Cray  XI  MSP 

0.8GHz 

120 

1.0610 

2.460 

8.496 

0.83014 

20.12 

Cray  XI  MSP 

0.8GHz 

252 

2.3847 

97.408 

14.914 

0.42899 

22.27 

Cray  XI  MSP 

0.8GHz 

124 

1.2054 

39.525 

14.973 

0.70857 

20.15 

Cray  XI  MSP 

0.8GHz 

60 

0.5087 

1.634 

0.003075 

14.902 

3.144 

10.915 

1.16779 

14.66 

Cray  XI  MSP 

.8GHz 

32 

0.2767 

32.661 

0.001662 

14.870 

2.965 

8.258 

1.41269 

14.94 

Figure  2:  Sample  results  page. 


communication.  But  only  standard  MPI 
calls  are  to  be  preformed  and  only  to  the 
MPI  library  that  is  widely  available  on 
the  tested  system. 

(b)  Limitations  of  Optimization 

i.  Code  with  limited  calculation  accuracy 
The  calculation  should  be  carried  out  in 
full  precision  (64-bit  or  the  equivalent). 
However  the  substitution  of  algorithms 
is  allowed  (see  Exchange  of  the  used 
mathematical  algorithm). 

ii.  Exchange  of  the  used  mathematical  al¬ 
gorithm 

Any  change  of  algorithms  must  be  fully 
disclosed  and  is  subject  to  review  by  the 
HPC  Challenge  Committee.  Passing  the 
verification  test  is  a  necessary  condition 
for  such  an  approval.  The  substituted  al¬ 
gorithm  must  be  as  robust  as  the  base¬ 
line  algorithm.  For  the  matrix  multiply 
in  the  HPL  benchmark,  Strassen  Algo¬ 


rithm  may  not  be  used  as  it  changes  the 
operation  count  of  the  algorithm. 

iii.  Using  the  knowledge  of  the  solution 
Any  modification  of  the  code  or  input 
data  sets,  which  uses  the  knowledge  of 
the  solution  or  of  the  verification  test,  is 
not  permitted. 

iv.  Code  to  circumvent  the  actual  computa¬ 
tion 

Any  modification  of  the  code  to  circum¬ 
vent  the  actual  computation  is  not  per¬ 
mitted. 

Software  Download,  Installa¬ 
tion,  and  Usage 

The  reference  implementation  of  the  benchmark  may 
be  obtained  free  of  charge  at  the  benchmark’s  web  site: 
http://icl.cs.utk.edu/hpcc/.  The  reference  im¬ 
plementation  should  be  used  for  the  base  run.  The  in- 
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Figure  3:  Sample  kiviat  diagram  of  results  for  two  generations  of  hardware  the  same  vendor  with  different  number 
of  threads  per  MPI  node. 


stallation  of  the  software  requires  creating  a  script  file 
for  Unix’s  make(l)  utility.  The  distribution  archive 
comes  with  script  files  for  many  common  computer  ar¬ 
chitectures.  Usually,  few  changes  to  one  of  these  files 
will  produce  the  script  file  for  a  given  platform. 

After,  a  succesful  compilation  the  benchmark  is 


ready  to  run.  However,  it  is  recommended  that  a 
changes  be  made  to  the  benchmark’s  input  file  that  de¬ 
scribes  the  sizes  of  data  to  use  during  run.  The  sizes 
should  reflect  the  available  memory  on  the  system  and 
number  of  processors  available  for  computations. 

We  have  collected  a  comprehensive  set  of  notes  on 
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the  HPCChallenge  benchmark.  They  can  be  found  at 

http : //icl . cs .utk . edu/hpcc/faq/. 

7  Example  Results 

Figure  2  show  a  sample  ren¬ 
dering  of  the  results  web  page: 

http : //icl . cs .utk . edu/hpcc/hpcc_re suits . cgi. 
Figure  3  show  a  sample  kiviat  diagram  generated  using 
the  benchmark  results. 

8  Conclusions 

No  single  test  can  accurately  compare  the  performance 
of  HPC  systems.  The  HPCChallenge  benchmark  test 
suite  stresses  not  only  the  processors,  but  the  mem¬ 
ory  system  and  the  interconnect.  It  is  a  better  indica¬ 
tor  of  how  an  HPC  system  will  perform  across  a  spec¬ 
trum  of  real-world  applications.  Now  that  the  more 
comprehensive,  informative  HPCChallenge  benchmark 
suite  is  available,  it  can  be  used  in  preference  to  com¬ 
parisons  and  rankings  based  on  single  tests.  The  real 
utility  of  the  HPCChallenge  benchmarks  are  that  archi¬ 
tectures  can  be  described  with  a  wider  range  of  metrics 
than  just  Flop/s  from  HPL.  When  looking  only  at  HPL 
performance  and  the  Top500  List,  inexpensive  build- 
your-own  clusters  appear  to  be  much  more  cost  effec¬ 
tive  than  more  sophisticated  HPC  architectures.  Even 
a  small  percentage  of  random  memory  accesses  in  real 
applications  can  significantly  affect  the  overall  perfor¬ 
mance  of  that  application  on  architectures  not  designed 
to  minimize  or  hide  memory  latency.  HPCChallenge 
benchmarks  provide  users  with  additional  information 
to  justify  policy  and  purchasing  decisions.  We  expect  to 
expand  and  perhaps  remove  some  existing  benchmark 
components  as  we  learn  more  about  the  collection. 
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import  numarray,  time 
import  numarray . random_array  as  naRA 
import  numarray . linear_algebra  as  naLA 
n  =  1000 

a  =  naRA. random ( [n,  n] ) 
b  =  naRA. random ( [n,  1]) 
t  =  -time. time  () 

x  =  naLA. solve_linear_equations (a,  b) 

t  +=  time .time ( ) 

r  =  numarray . dot (a,  x)  -  b 

r_n  =  numarray .maximum. reduce (abs (r) ) 

print  t,  2.0e-9  /  3.0  *  n**3  /  t 

print  r_n,  r_n  /  (n  *  le-16) 

Figure  4:  Python  code  implementing  Finpack  bench 
mark. 
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B  Reference  Sequential  Imple¬ 
mentation 

Figures  4,  5,  6,  7,  8,  and  9  show  reference  implementa¬ 
tions  of  the  tests  from  the  HPCChallenge  suite.  Python 
was  chosen  (as  opposed  to,  say,  Matlab)  to  show  that 
the  tests  can  be  easily  implemented  in  a  popular  general 
purpose  language. 
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import  numarray,  time 
import  numarray . random_array  as  naRA 
import  numarray . linear_algebra  as  naLA 
m  =  1000 

a  =  naRA. random ( [m,  1]) 
alpha  =  naRA. random ( [1,  1])  [0] 

Copy,  Scale  =  "Copy",  "Scale" 

Add,  Triad  =  "Add",  "Triad" 
td  =  { } 

td[Copy]  =  -time. time () 
c  =  a  [ :  ] 

td[Copy]  +=  time. time () 
td[Scale]  =  -time. time () 
b  =  alpha  *  c 
td[Scale]  +=  time. time () 
td[Add]  =  -time. time () 
c  =  a  *  b 

td[Add]  +=  time. time () 
td[Triad]  =  -time. time () 
a  =  b  +  alpha  *  c 
td[Triad]  +=  time. time () 
for  op  in  (Copy,  Scale,  Add,  Triad) : 
t  =  td[op] 

s  =  op  [0]  in  ("C",  "S")  and  2  or  3 
print  op,  t,  8.0e-9  *  s  *  m  /  t 

Figure  5:  Python  code  implementing  STREAM  bench¬ 
mark. 


from  time  import  time 
from  numarray  import  * 
m  =  1024 

table  =  zeros ( [m] ,  UInt64) 
ran  =  zeros ([128],  UInt64) 
mupdate  =  4  *  m 

POLY,  PERIOD  =  7,  1317624576693539401L 

def  starts  (n) : 

n  =  array ( [n] ,  Int64) 
m2  =  zeros  ([64],  UInt64) 

while  (n  [0]  <  0) :  n  +=  PERIOD 
while  (n  [0]  >  PERIOD):  n  -=  PERIOD 
if  (n [0]  ==  0) :  return  1 

temp  =  array  ([1],  UInt64) 
for  i  in  range (64): 
m2 [i]  =  temp [0] 
for  j  in  range  (2) : 
v  =  0 

if  temp.astype (Int64) [0]  <  0:  v  =  POLY 
temp  =  (temp  <<  1)  v 
for  i  in  range  (62,  -1,  -1): 
if  ((n»i)  &  1)  [0]  :  break 

ran  =  array ([2],  UInt64) 
while  (i  >  0) : 
temp[0]  =  0 
for  j  in  range  (64): 

if  ((ran>>j)  &  1)  [0]  :  temp  'N=m2[j] 
ran[0]  =  temp[0] 
i  -=  1 

if  ( (n»i)  &  1)  [0]  : 
v  =  0 

if  ran.astype  (Int64)  [0]  <  0:  v  =  POLY 
ran  =  (ran  <<  1)  v 
return  ran[0] 

t  =  -time() 

for  i  in  range (m) :  table [i]  =  i 
for  j  in  range (128): 

ran[j]  =  starts (mupdate  /  128  *  j) 
for  i  in  range (mupdate  /  128): 
for  j  in  range  (128): 
v  =  0 

if  ran . astype  (Int64)  [j]  <  0:  v  =  POLY 
ran[j]  =  (ran[j]  <<  1)  v 
table [ran[j]  &  (m  -  1)]  "=  ran[j] 
t  +=  time () 

temp  =  array([l],  UInt64) 
for  i  in  range (mupdate) : 
v  =  0 

if  temp.astype (Int64) [0]  <  0:  v  =  POLY 

temp  =  (temp  <<  1)  v 

table [temp  &  (m  -  1) ]  "=  temp 

temp  =  0 

for  i  in  range (m) : 

if  table [i]  !=  i:  temp  +=  1 

print  t,  100.0  *  temp  /  m 

Figure  6:  Python  code  implementing  RandomAccess 
benchmark. 
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import  numarray,  time 
import  numarray . random_array  as  naRA 
import  numarray . linear_algebra  as  naLA 
n  =  1000 

a  =  naRA. random ( [n,  n] ) 

b  =  naRA. random ( [n,  n] ) 

t  =  -time. time () 

a  =  numarray. transpose (a) +b 

t  +=  time. time () 

print  t,  8e-9  *  n**2  /  t 

Figure  7:  Python  code  implementing  PTRANS  bench 
mark. 

import  numarray,  numarray . f ft,  time,  math 
import  numarray . random_array  as  naRA 
m  =  1024 

a  =  naRA. random ( [m,  1]) 

t  =  -time. time () 
b  =  numarray . fft . fft (a) 
t  +=  time. time () 

r  =  a  -  numarray . fft . inverse_f ft (b) 
r_n  =  numarray .maximum. reduce (abs (r) ) 
print  t,  5e-9  *  m  *  math. log (m)  /  t,  r_n 

Figure  8:  Python  code  implementing  FFT  benchmark. 


import  numarray,  time 

import  numarray . random_array  as  naRA 

n  =  1000 

a  =  naRA. random ( [n,  n] ) 
b  =  naRA. random ( [n,  n] ) 
c  =  naRA. random ( [n,  n] ) 
alpha  =  a[n/2,  0] 
beta  =  b [n/2,  0] 
t  =  -time. time () 

c  =  beta  *  c  +  alpha  *  numarray . dot (a,  b) 

t  +=  time. time () 

print  t,  2e-9  *  n**3  /  t 

Figure  9:  Python  code  implementing  DGEMM  bench 
mark. 


